qa dataset
- Europe > Belgium > Brussels-Capital Region > Brussels (0.14)
- Europe > Romania (0.04)
- Europe > United Kingdom > England (0.04)
- (19 more...)
- Health & Medicine > Therapeutic Area > Endocrinology (1.00)
- Education (1.00)
- Banking & Finance (0.92)
- (3 more...)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > British Indian Ocean Territory > Diego Garcia (0.04)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Health Care Technology (0.69)
- Health & Medicine > Nuclear Medicine (0.68)
Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts
Gupta, Raavi, Panicker, Pranav Hari, Bhatia, Sumit, Ramakrishnan, Ganesh
Large language models (LLMs), despite their remarkable text generation capabilities, often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge. This poses serious risks in domains like healthcare, finance, and customer support. A typical way to use LLMs is via the APIs provided by LLM vendors where there is no access to model weights or options to fine-tune the model. Existing methods to detect hallucinations in such settings where the model access is restricted or constrained by resources typically require making multiple LLM API calls, increasing latency and API cost. We introduce CONFACTCHECK, an efficient hallucination detection approach that does not leverage any external knowledge base and works on the simple intuition that responses to factual probes within the generated text should be consistent within a single LLM and across different LLMs. Rigorous empirical evaluation on multiple datasets that cover both the generation of factual texts and the open generation shows that CONFACTCHECK can detect hallucinated facts efficiently using fewer resources and achieves higher accuracy scores compared to existing baselines that operate under similar conditions. Our code is available here.
- Europe (1.00)
- North America > United States > California (0.28)
- Research Report (1.00)
- Overview > Fact Book (0.43)
A Multifaceted Analysis of Negative Bias in Large Language Models through the Lens of Parametric Knowledge
Song, Jongyoon, Yu, Sangwon, Yoon, Sungroh
Abstract--Negative bias refers to the tendency of large language models (LLMs) to excessively generate negative responses in binary decision tasks (e.g., yes-no question answering). Previous research has focused on detecting and addressing negative attention heads that induce negative bias. However, the underlying detailed factors influencing negative bias remain underexplored. In this paper, we demonstrate that LLMs exhibit format-level negative bias, meaning the prompt format more influences their responses than the semantics of the negative response. For the fine-grained study of the negative bias, we introduce a pipeline for constructing the evaluation set, which systematically categorizes the dataset into three subsets based on the model's parametric knowledge: correct, incorrect, and insufficient relevant knowledge. Through analysis of this evaluation set, we identify a shortcut behavior in which models tend to generate negative responses when they lack sufficient knowledge to answer a yes-no question, leading to negative bias. We further examine how negative bias changes under various prompting scenarios related to parametric knowledge. We observe that providing relevant context and offering an "I don't know" option generally reduces negative bias, whereas chain-of-thought prompting tends to amplify the bias. Finally, we demonstrate that the degree of negative bias can vary depending on the type of prompt, which influences the direction of the response. Our work reveals the various factors that influence negative bias, providing critical insights for mitigating it in LLMs. ECENT advances in the capabilities and emergent abilities of large language models (LLMs) have led to rapid improvements in the performance of a wide range of natural language processing (NLP) tasks [1]-[5]. Leveraging their ability to follow instructions, LLMs are able to perform complex, previously unseen tasks, enabling human-like interactions [6]-[9]. One critical issue is the hallucination problem, where the model generates content that contains misleading information, which does not correspond to the given context or real-world knowledge [11]. J. Song was with the Department of Electrical and Computer Engineering at Seoul National University, South Korea (coms1580@gmail.com).
- North America > United States > Minnesota (0.28)
- Asia > South Korea > Seoul > Seoul (0.24)
DeepSpecs: Expert-Level Questions Answering in 5G
Manvattira, Aman Ganapathy, Xu, Yifei, Dang, Ziyue, Lu, Songwu
5G technology enables mobile Internet access for billions of users. Answering expert-level questions about 5G specifications requires navigating thousands of pages of cross-referenced standards that evolve across releases. Existing retrieval-augmented generation (RAG) frameworks, including telecom-specific approaches, rely on semantic similarity and cannot reliably resolve cross-references or reason about specification evolution. We present DeepSpecs, a RAG system enhanced by structural and temporal reasoning via three metadata-rich databases: SpecDB (clause-aligned specification text), ChangeDB (line-level version diffs), and TDocDB (standardization meeting documents). DeepSpecs explicitly resolves cross-references by recursively retrieving referenced clauses through metadata lookup, and traces specification evolution by mining changes and linking them to Change Requests that document design rationale. We curate two 5G QA datasets: 573 expert-annotated real-world questions from practitioner forums and educational resources, and 350 evolution-focused questions derived from approved Change Requests. Across multiple LLM backends, DeepSpecs outperforms base models and state-of-the-art telecom RAG systems; ablations confirm that explicit cross-reference resolution and evolution-aware retrieval substantially improve answer quality, underscoring the value of modeling the structural and temporal properties of 5G standards.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
Benchmarking GPT-5 for biomedical natural language processing
Hou, Yu, Zhan, Zaifu, Zeng, Min, Wu, Yifan, Zhou, Shuang, Zhang, Rui
Biomedical literature and clinical narratives pose multifaceted challenges for natural language understanding, from precise entity extraction and document synthesis to multi-step diagnostic reasoning. This study extends a unified benchmark to evaluate GPT-5 and GPT-4o under zero-, one-, and five-shot prompting across five core biomedical NLP tasks: named entity recognition, relation extraction, multi-label document classification, summarization, and simplification, and nine expanded biomedical QA datasets covering factual knowledge, clinical reasoning, and multimodal visual understanding. Using standardized prompts, fixed decoding parameters, and consistent inference pipelines, we assessed model performance, latency, and token-normalized cost under official pricing. GPT-5 consistently outperformed GPT-4o, with the largest gains on reasoning-intensive datasets such as MedXpertQA and DiagnosisArena and stable improvements in multimodal QA. In core tasks, GPT-5 achieved better chemical NER and ChemProt scores but remained below domain-tuned baselines for disease NER and summarization. Despite producing longer outputs, GPT-5 showed comparable latency and 30 to 50 percent lower effective cost per correct prediction. Fine-grained analyses revealed improvements in diagnosis, treatment, and reasoning subtypes, whereas boundary-sensitive extraction and evidence-dense summarization remain challenging. Overall, GPT-5 approaches deployment-ready performance for biomedical QA while offering a favorable balance of accuracy, interpretability, and economic efficiency. The results support a tiered prompting strategy: direct prompting for large-scale or cost-sensitive applications, and chain-of-thought scaffolds for analytically complex or high-stakes scenarios, highlighting the continued need for hybrid solutions where precision and factual fidelity are critical.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.46)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- Europe > Switzerland > Basel-City > Basel (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.68)
- Health & Medicine > Therapeutic Area (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.46)
- Europe > Belgium > Brussels-Capital Region > Brussels (0.14)
- Europe > Romania (0.04)
- Europe > United Kingdom > England (0.04)
- (19 more...)
- Health & Medicine > Therapeutic Area > Endocrinology (1.00)
- Education (1.00)
- Banking & Finance (0.92)
- (3 more...)
- Asia > South Korea > Seoul > Seoul (0.04)
- Asia > Singapore (0.04)
- Asia > Middle East > Israel (0.04)
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Health Care Technology (0.69)
- Health & Medicine > Nuclear Medicine (0.68)
Exploring Models and Data for Image Question Answering
Mengye Ren, Ryan Kiros, Richard Zemel
This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use neural networks and visual semantic embeddings, without intermediate stages such as object detection and image segmentation, to predict answers to simple questions about images. Our model performs 1.8 times better than the only published results on an existing image QA dataset. We also present a question generation algorithm that converts image descriptions, which are widely available, into QA form. We used this algorithm to produce an order-of-magnitude larger dataset, with more evenly distributed answers. A suite of baseline results on this new dataset are also presented.
- North America > Canada > Ontario > Toronto (0.15)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)